Cortex kv panel improvements #371

bboreham · 2021-08-17T17:09:43Z

What this PR does:

Distributor uses multiple kv stores - for global limits and ha-tracker, as well as reading from the ingester ring - so we need to narrow the panel to just the one it says it is showing, e.g. kv_name="distributor-hatracker".

For consistency, do the same on the ingester panel, although currently ingesters only have one kv store.

Created as Draft because the renamed recording rule will mean that dashboards show no data for latency prior to the change. Do we want to cope with this? We could add an extra three queries on the latency panels with the old name, but it will get quite ugly in the code. Discuss.

Also, I added a couple of panels with more info on what the KV store is doing.

For HA-tracker, show which tenants are changing election.
For ingester, show how many are active, leaving, etc.

Checklist

CHANGELOG.md updated

Distributor uses multiple kv stores - for global limits and ha-tracker, as well as reading from the ingester ring - so we need to narrow the panel to just the one it says it is showing. For consistency, do the same on the ingester panel, although currently ingesters only have one kv store. Note that the renamed recording rule will mean that dashboards show no data for latency prior to the change.

For HA-tracker, show which tenants are changing election. For ingester, show how many are active, leaving, etc.

pracucci

Created as Draft because the renamed recording rule will mean that dashboards show no data for latency prior to the change. Do we want to cope with this? We could add an extra three queries on the latency panels with the old name, but it will get quite ugly in the code. Discuss.

If it was the read/write latency, I would have said definitely yes. In this case it's the KV latency so we can probably accept the fact we'll "loose the history" when watching this dashboard in the past (or over la large period), so to me it's OK.

pracucci · 2021-08-20T13:51:57Z

cortex-mixin/dashboards/writes.libsonnet

+      .addPanel(
+        $.panel('Elected replica changes / min') +
+        $.queryPanel([
+          'max by(exported_cluster, user)(increase(cortex_ha_tracker_elected_replica_changes_total{%s}[1m])) >0' % $.jobMatcher($._config.job_names.distributor),


This can be a pretty high cardinality query. For example, I just tested in a cluster with few thousands of tenants and running it over "Last 12h" took 25s on cold caches. What if we add a recording rule for that? I would expect the > 0 narrows down a lot the cardinality (other than squashing it by a "distributors replicas" number factor).

CLAassistant · 2022-06-15T17:47:44Z

All committers have signed the CLA.

bboreham force-pushed the cortex-kv-panels branch from 58b64a7 to 193c647 Compare August 18, 2021 09:07

Cortex writes dashboard: add panels for KV activity

76b89ff

For HA-tracker, show which tenants are changing election. For ingester, show how many are active, leaving, etc.

bboreham force-pushed the cortex-kv-panels branch from 193c647 to 76b89ff Compare August 18, 2021 11:19

pracucci reviewed Aug 20, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cortex kv panel improvements #371

Cortex kv panel improvements #371

bboreham commented Aug 17, 2021 •

edited

Loading

pracucci left a comment

pracucci Aug 20, 2021

CLAassistant commented Jun 15, 2022 •

edited

Loading

Cortex kv panel improvements #371

Are you sure you want to change the base?

Cortex kv panel improvements #371

Conversation

bboreham commented Aug 17, 2021 • edited Loading

pracucci left a comment

Choose a reason for hiding this comment

pracucci Aug 20, 2021

Choose a reason for hiding this comment

CLAassistant commented Jun 15, 2022 • edited Loading

bboreham commented Aug 17, 2021 •

edited

Loading

CLAassistant commented Jun 15, 2022 •

edited

Loading